Hello everyone! Welcome to a new blog post where we explore the exciting world of Bayesian statistics and its applications in sports analytics. Today, I’m thrilled to share with you a fascinating project that combines the power of Bayesian inference with the Markov Chain Monte Carlo (MCMC) method to predict the outcomes of soccer matches in the Leagues Cup 2024. By using a Poisson-Gamma model, we can examine the intricacies of soccer match predictions and provide insights into the expected performance of various teams. ⚽
In my previous blog post, I discussed Bayesian statistics and its importance in machine learning, emphasizing why it’s essential to think like a Bayesian. The flexibility and robustness of Bayesian methods allow for better handling of uncertainty and incorporating prior knowledge into models, making them highly effective in various applications, including sports analytics and predictions.
About
The Leagues Cup features teams from Liga MX and MLS playing on American soil. All 47 teams from their respective leagues in Mexico, the United States, and Canada have the chance to compete in this World Cup-style tournament sanctioned by Concacaf.
Format
The Leagues Cup is the first two-league tournament sanctioned by Concacaf.
It features 77 matches in one month, hosted at MLS stadiums in the U.S. and Canada. 29 MLS clubs and 18 Liga MX clubs participate each year.
The tournament has a World Cup-style format:
Group Stage with East and West regions
Knockout Rounds until the champion is crowned
The top three teams get automatic bids to the Concacaf Champions Cup.
These teams have a chance to compete in the FIFA Club World Cup.
Point System
Regulation wins: 3 points
No ties!
Tied after 90 minutes: 1 point each
Penalty shootout winner: 1 additional point
Group Stage
Two teams bypass the group stage and go directly to the Round of 32:
Columbus Crew (2023 MLS Cup winner)
Club América (most points in Clausura 2023 and Apertura 2023)
Each club plays at least 2 matches in the Group Stage.
The top two teams from each group advance to the Round of 32.
Group Assignments
In 2024, a tiered Leagues Cup Ranking system was introduced:
Clubs are ranked based on their performance in the last 34 matches.
MLS teams use the 2023 MLS Supporters’ Shield standings.
LIGA MX clubs use cumulative points from Clausura 2023 and Apertura 2023.
Clubs are divided into three tiers (1-15, 16-30, 31-45).
There are 15 groups of three clubs each, one from each tier.
Groups are formed based on geographical and competitive balance.
The LIGA MX Champion and the top three LIGA MX clubs receive hub privileges:
Reduced travel for LIGA MX clubs.
Play as the home team in predetermined venues.
Knockout Rounds
Single game elimination to move to the next round.
Knockout rounds begin with the Round of 32 (16 matches).
Winners from the Round of 32 advance to the Round of 16 (8 matches).
Winners from the Round of 16 advance to the Quarterfinals (8 teams).
Winners of the Quarterfinals advance to the Semifinals (4 teams).
Losers of the Semifinals compete in a Third Place Match.
Winners of the Semifinals compete in the Leagues Cup Final on Sunday, August 25.
Tournament Rewards
The winner of the Third Place Match qualifies for the Concacaf Champions Cup.
Both clubs in the Leagues Cup Final automatically qualify for the Concacaf Champions Cup.
The Leagues Cup 2023 Champion qualifies directly to the Concacaf Champions Cup Round of 16.
To start the tournament, we will first analyze both leagues’ information about the distribution of goals.
Goal Proportion Distribution
Here the data used in this analysis was acquired from Liga MX and MLS match results as of July 27th, 2024, from their respective league matches.
To illustrate the goal distribution in Liga MX and MLS, here is a chart showing the proportion of matches with different numbers of goals:
Understanding these distributions is essential for modeling and predicting match outcomes as they provide a foundational understanding of how often teams score goals in each league.
Overview of the Poisson Distribution
Here the data used in this analysis was acquired from Liga MX and MLS match results as of July 27th, 2024, from their respective league matches.
Additionally, I provided an overview of the Poisson distribution which is commonly used to model count data such as the number of goals scored by a team in a match. The Poisson distribution is characterized by a single parameter, \(\lambda\), representing the average rate of occurrences (goals) within a fixed interval (a match). For a refresher, the probability mass function (PMF) of the Poisson distribution is:
\[P(Y = y | \lambda) = \frac{\lambda^y e^{-\lambda}}{y!}\]
where:
- \(Y\) is the random variable representing the number of goals.
- \(\lambda\) is the rate parameter (average number of goals).
- \(y\) is the actual observed count of goals.
Understanding the Poisson distribution is crucial as it forms the basis for our Poisson-Gamma model used in this project.
Comparing Observed and Poisson-Modeled Goals
To further illustrate our analysis, we have included charts comparing observed and Poisson-modeled goals for teams in Liga MX and MLS. These comparisons highlight how well the Poisson model fits the observed goal distributions for each team.
Liga MX Teams
MLS Teams
The charts above compare the observed goals (darker bars) and the Poisson-modeled goals (lighter bars) for each team in Liga MX and MLS. Each subplot represents a different team, showing the probability of scoring a certain number of goals in a match.
Key points:
Teams with higher \(\lambda\) values tend to have a higher probability of scoring more goals in a match.
The fit of the Poisson model varies across teams, indicating differences in scoring patterns and consistency.
These visualizations help us understand the effectiveness of the Poisson model in capturing the goal-scoring behavior of different teams, which is crucial for accurate predictions.
Model Formulation
Let \(\lambda\) be an unknown rate parameter and \((Y_1, Y_2, \ldots, Y_n)\) be an independent Poisson(\(\lambda\)) sample. The Gamma-Poisson Bayesian model complements the Poisson structure of data \(Y\) with a Gamma prior on \(\lambda\):
\[Y_i \mid \lambda \sim \text{Poisson}(\lambda)\]
\[\lambda \sim \text{Gamma}(s, r)\]
Upon observing data \(\mathbf{y} = (y_1, y_2, \ldots, y_n)\), the posterior distribution of \(\lambda\) is given by:
\[ f(\lambda \mid \mathbf{y}) \propto f(\lambda) L(\lambda \mid \mathbf{y}) = \frac{r^s}{\Gamma(s)} \lambda^{s-1} e^{-r \lambda} \cdot \frac{\lambda^{\sum y_i} e^{-\lambda n}}{\prod y_i!}, \quad \text{for } \lambda > 0 \]
Next, remember that any non-\(\lambda\) multiplicative constant in the above equation can be “proportional-ed” out. Thus, boiling the prior pdf and likelihood function down to their kernels, we get:
\[ f(\lambda \mid \mathbf{y}) \propto \lambda^{s-1} e^{-r\lambda} \cdot \lambda^{\sum y_i} e^{-n\lambda} \]
This simplifies to:
\[ f(\lambda \mid \mathbf{y}) \propto \lambda^{s + \sum y_i - 1} e^{-(r + n)\lambda} \]
This results in the posterior distribution:
\[ \lambda \mid \mathbf{y} \sim \text{Gamma} \left( s + \sum_{i=1}^n y_i, r + n \right) \]
where:
- \(f(\lambda)\) is the prior distribution,
- \(L(\lambda \mid \mathbf{y})\) is the likelihood function,
- \(\Gamma(s)\) is the Gamma function,
- \(s\) and \(r\) are the shape and rate parameters of the prior Gamma distribution.
Bayesian Inference with MCMC
Markov Chain Monte Carlo (MCMC) methods are used to draw samples from the posterior distribution of the model parameters. In this post, we will use the Metropolis-Hastings algorithm to perform MCMC sampling.
Metropolis-Hastings Algorithm
The Metropolis-Hastings algorithm is used to create a Markov chain with the sequence \(\{\mu^{(1)}, \mu^{(2)}, \ldots, \mu^{(N)}\}\). Below, we will explain the details of this algorithm and how to apply it in the context of a Normal-Normal Bayesian model.
Given data \(y\), assume the parameter \(\mu\) has a posterior probability density function (pdf) \(f(\mu|y)\) proportional to \(f(\mu)L(\mu|y)\). A Metropolis-Hastings Markov chain for \(f(\mu|y)\) evolves as follows. Let \(\mu^{(i)} = \mu\) represent the chain’s position at iteration \(i \in \{1, 2, \ldots, N-1\}\). The next position, \(\mu^{(i+1)}\), is determined through the following two steps:
Step 1: Propose a new location. Given the current position \(\mu\), draw a new position \(\mu'\) from a proposal distribution with pdf \(q(\mu'|\mu)\).
Step 2: Decide whether to move to the new location.
Compute the acceptance probability, which is the probability of accepting the proposed new position \(\mu'\):
\[ \alpha = \min \left\{ 1, \frac{f(\mu')L(\mu'|y) \cdot q(\mu|\mu')}{f(\mu)L(\mu|y) \cdot q(\mu'|\mu)} \right\} \]
In practical terms, this means flipping a weighted coin. If the result is heads (with probability \(\alpha\)), move to the proposed position \(\mu'\). If the result is tails (with probability \(1 - \alpha\)), stay at the current position \(\mu\):
\[ \mu^{(i+1)} = \begin{cases} \mu' & \text{with probability } \alpha \\ \mu & \text{with probability } 1 - \alpha \end{cases} \]
Posterior Predictive Model
The posterior predictive model allows us to make predictions about new data based on the posterior distribution of the parameters. Let’s use the Poisson-Gamma model as an example. Suppose \(y\) represents the number of events occurring in a fixed period, which follows a Poisson distribution with rate parameter \(\lambda\). The prior distribution for \(\lambda\) is assumed to be Gamma distributed with shape parameter \(s\) and rate parameter \(r\).
The posterior distribution of \(\lambda\), given observed data \(y\), is also a Gamma distribution, specifically \(\text{Gamma}(s + \sum y_i, r + n)\).
Let \(y'\) denote a new outcome of the variable \(y\). The posterior predictive distribution of \(y'\) given the observed data \(y\) can be derived by integrating over the posterior distribution of \(\lambda\):
\[ f(y' \mid y) = \int f(y' \mid \lambda) f(\lambda \mid y) d\lambda \]
Practical Example: Posterior Analysis for Club America
To illustrate this method, let’s focus on Club America from Mexico.
Data Collection and Model Fitting
We collected historical data on the number of goals scored by Club America in recent matches. Using this data, we fit the Poisson-Gamma model to estimate the rate parameter (\(\lambda\)) for Club America. The data is relevant as of July 27th, 2024, and we performed 4000 tests using prior parameters based on historical data from both Liga MX and MLS.
Posterior Analysis
After running the MCMC sampling, we obtained the posterior distribution of \(\lambda\) for Club America. Below, we present the trace plots and density plots for the posterior samples of \(\lambda\) along with numerical diagnostics.
Trace Plots
The trace plots show the values of \(\lambda\) sampled by the MCMC algorithm over iterations, allowing us to assess the mixing and convergence of the chain.
The trace plots for Lambda across four MCMC chains show the goals (Lambda) over 1000 iterations. There are no obvious patterns in the trace plots, indicating that the chains are well-mixed and stable. This suggests that the MCMC process is sampling efficiently from the posterior distribution, without strong dependencies between successive samples.
Density Plots
The density plots display the distribution of the posterior samples for \(\lambda\), providing insights into the central tendency and spread of the estimated goal rate.
Key Points:
The histogram (top) indicates that most simulations result in a moderate number of goals, with fewer simulations predicting very high or very low goal counts.
The overall density plot (bottom left) provides a smooth estimate of this distribution, confirming the bell-shaped curve observed in the histogram.
The density plots for four MCMC chains (bottom right) demonstrate consistency across chains, suggesting that the MCMC process is stable and well-mixed, producing reliable and consistent samples.
Numerical Diagnostics
We can supplement these visual diagnostics with numerical diagnostics such as effective sample size ratio, autocorrelation, and R-hat:
- Effective Sample Size Ratio:
The effective sample size ratio for Lambda is 0.437. This ratio quantifies the number of independent samples in relation to the total number of samples. A ratio less than 0.1 might be concerning, but a ratio of 0.437 indicates moderate sampling efficiency. The effective sample size ratio \((\frac{N_{eff}}{N})\) measures the accuracy of the Markov chain approximation relative to an independent sample.
- Auto-correlation:
Autocorrelation provides another metric by which to evaluate whether our Markov chain sufficiently mimics the behavior of an independent sample. The auto-correlation plots help us assess the independence of the samples, indicating how quickly the chain mixes.
The autocorrelation plots for Lambda across four MCMC chains illustrate how the autocorrelation decreases with increasing lag. The lag 0 autocorrelation is naturally 1, while the lag 1 autocorrelation is around 0.5, indicating moderate correlation among chain values that are one step apart. Beyond lag 5, the autocorrelation is effectively zero, implying very little correlation between chain values that are more than a few steps apart. This is a good sign, confirming that the Markov chains are mixing quickly and efficiently.
- R-hat:
The \(\hat{R}\) value for Lambda is 1.0008, which is very close to 1, suggesting that the chains have converged well. The R-hat metric \((\hat{R})\) calculates the ratio of variability between combined chains and within individual chains, ideally close to 1, indicating stability. An R-hat value greater than 1.05 raises concerns about chain convergence.
Posterior Summaries and Results
We summarized the posterior samples of \(\lambda\) to provide point estimates and credible intervals. These summaries offer a concise representation of the estimated goal rate for Club America.
The table above summarizes key posterior metrics such as the mean, median, mode, and the 95% credible interval (lower and upper bounds). These summaries provide a comprehensive view of the posterior distribution, highlighting the central tendency and the variability of the predicted goals for Club America.
Now, we’ve used our MCMC simulation to approximate the posterior predictive model of \(y'\) the number of goals Club America is expected to score in their next match. The punchline is this: MCMC worked. The approximations are quite accurate, providing confidence in our predictions.
The table above shows the comparison between the posterior value and the MCMC approximation for the expected goals of Club America. Key metrics such as the mean, mode, and percentiles (2.5th and 97.5th) are included, confirming that the MCMC method provides accurate approximations of the posterior distribution and reinforcing the reliability of our predictions.
Posterior Predictions
Using the posterior predictive model, we simulated the expected number of goals for Club America in an upcoming match. This involved generating 4000 new data points based on the posterior distribution of \(\lambda\).
The histogram above shows the distribution of simulated goals for Club America. The predicted number of goals for an upcoming match is approximately 1.5 goals, based on the posterior predictive model.
Predictions for All Teams
Again, using the posterior predictive model, we simulated the expected number of goals for all teams in their upcoming matches with 4000 runs. This extensive simulation, based on the current state of the tournament as of July 27, 2024, allows us to delve further into making projections and prediction probabilities.
Results
The following table chart shows the prediction probabilities for all teams in the Leagues Cup 2024, categorized by their groups and positions (1st, 2nd, Round of 32, and Points). 📊
Key Points:
Orlando City has the highest probability of making it to the last 32 (100%) with the highest expected points (5.21).
Inter Miami shows strong probabilities with a 69% chance of finishing 1st and a 99% chance of making it to the last 32.
Seattle leads in West 6 with a 67% chance of finishing 1st and a 98% chance of making it to the last 32 with 4.64 expected points.
Tijuana has the most challenging situation in West 7 with a 0% chance of finishing 1st and only a 17% chance of making it to the last 32.😞
League Advancements
Just by looking at the plot projections:
It appears that 8 out of the 17 teams from Liga MX will advance to the next round, which is approximately 47%.
Additionally, 22 out of the 28 teams from Major League Soccer (MLS) are projected to advance to the next round, which is about 79%.
Conclusion
In this post, we explored using Bayesian MCMC methods with a Poisson-Gamma model to predict the outcomes of soccer matches in the Leagues Cup 2024. By harnessing the power of Bayesian inference, we can integrate prior knowledge and refine our predictions as new data is gathered. This approach offers a versatile and reliable framework for sports analytics.
That’s it for now! Stay tuned for more insights and analyses as the tournament progresses, and don’t hesitate to share your thoughts and predictions in the comments! 😎